Astronomy & Astrophysics manuscript no. gamma ray'DBSCAN 


©ESO 2012 


October 2, 2012 





7-ray DBSCAN: a clustering algorithm applied to 
Fermi-LfiJ 7-ray data. 

I: Detection performances with real and simulated data 

A. Tramacere' and C. Vecchio^ 

' ISDC, University of Geneva, Chemin d'Ecogia 16, Versoix, CH-1290, Switzerland e-mail: andrea . tramacereOunige . ch 
- Politecnico di Milano - Piazza L. da Vinci, 32 - 20133 Milano, Italy 

Received; accepted 

ABSTRACT 

Context. The Density Based Spatial Clustering of Applications with Noise (DBSCAN) is a topometric algorithm used to cluster 
spatial data that are affected by background noise. For the first time, we propose the use of this method for the detection of sources in 
y-ray astrophysical images obtained from the Fermi-hAT data, where each point corresponds to the arrival direction of a photon. 
Aims. We investigate the detection performance of the y-ray DBSCAN in terms of detection efficiency and rejection of spurious 
clusters. 

Methods. We use a parametric approach, exploring a large volume of the y-ray DBSCAN parameter space. By means of simulated 
data we statistically characterize the y-ray DBSCAN, finding signatures that differentiate purely random fields, from fields with 
sources. We define a significance level for the detected clusters, and we successfully test this significance with our simulated data. We 
apply the method to real data, and we find an excellent agreement with the results obtained with simulated data. 
Results. We find that the y-ray DBSCAN can be successfully used in the detection of clusters in y-ray data. The significance returned 
by our algorithm is strongly correlated with that provided by the Maximum Likelihood analysis with standard Fermi-hAT software, 
and can be used to safely remove spurious clusters. The positional accuracy of the reconstructed cluster centroid compares to that 
returned by standard Maximum Likelihood analysis, allowing to look for astrophysical counterparts in narrow regions, minimizing 
the chance probability in the counterpart association. 

Conclusions. We find that y-ray DBSCAN is a powerful tool in the detection of clusters in y-ray data, this method can be used both 
to look for point-like sources, and extended sources, and can be potentially applied to any astrophysical field related with detection of 
clusters in data. In a companion paper we will present the application of the y-ray DBSCAN to the full Fermi-hAT sky, discussing 
the potentiality in the discovery of new sources. 

Key words. Gamma rays: general - Methods: statistical ~ Methods: data analysis 



1. Introduction 

Modern y-ray telescopes operating at energies above the MeV 
window, provide event-resolved observational data. Each event 
(after the reconstruction process) is typically described by a tu- 
ple (i.e. an ordered list of elements) storing sky coordinates, 
arrival time, and energy. Detection of discrete sources (either 
point-like or extended) is performed using various methods. 
Given the discrete topological nature of y-ray images, meth- 
ods based on cluster search, like the Minimum Spanning Three 
(MST) (Campana et al. 2007 2012*) have successfully been used. 
One of the main advantages of topometric methods, compared 
to methods using the spatial binning, is to minimize the impact of 
the poor energy-dependent Point Spread Function (PSF), typical 
of y-ray telescopes, preserving the spatial information of each 
event. Moreover, these methods are able to detect sources com- 
pounded by a small amount of events, but they need to be fine 
tuned to take into account properly the background. The problem 
of background rejection is the most penalizing feature of topo- 
metric methods, for this reason in this paper, for the first time, we 
present a method based on the DBSCAN algorithm ( [Ester et ah] 
|1996 ). The DBSCAN is a topometric algorithm used to cluster 
spatial data that are affected by background noise. Compared to 
other topometric methods, it has the advantage to embed inside 



the algorithm itself the discrimination between signal (cluster) 
and background (noise), according to the local density of events 
within a typical scanning brush i.e. within a given scanning area. 

The aim of the present paper is to show the potentialities of 
the method, and its statistical characterization when applied to 
astrophysical y-ray data. We apply this method to the detection 
of point-like sources in the ferm/-LAT data. We explore a large 
volume of the y-ray DBSCAN parameter space, by means of 
simulated data, and we provide a statistical characterization the 
y-ray DBSCAN, finding signatures that differentiate purely ran- 
dom fields, from fields with sources. We define a significance 
level for the detected clusters, and we successfully test this sig- 
nificance with our simulated data. We apply the method to real 
Ferm/-LAT y-ray data, and we find an excellent agreement with 
the results obtained with simulated data. 



In a companion paper ( Tramacere |20 1 2[ l , we will apply the 
method to the ferm/-LAT sky, investigating specific issues re- 
lated to the Fermi-LAI response functions, showing the poten- 
tiality for the discovery of new sources, in particular of small 
clusters located at high galactic latitude, or clusters on the galac- 
tic plane, affected by a strong background. 

The paper is organized as follows. In Sec.|2]we describe the 
logic of the DBSCAN method, and we present the algorithm 
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implemented to analyse y-ray data, the y-ray DBSCAN. In Sec. 
[3]we discuss some caveats regarding the application of the y-ray 
DBSCAN algorithm to y-ray data. In Sec.|4]we study the statis- 
tical properties of the y-ray DBSCAN detection, using a simu- 
lated test field with only noise, and five simulated test fields with 
noise plus point-like sources. In Sec. |5]we evaluate the detec- 
tion performance of the method in terms of positional accuracy, 
cluster reconstruction, and rejection of spurious clusters. In Sec. 
[6] we investigate the significance of the clusters, and describe 
our algorithmic implementation. In Section [7] we finally use our 
method with real Ferm/-LAT data, investigating the detection 
performance, and comparing the y-ray DBSCAN clusters sig- 
nificance, to that returned by the Maximum Likelihood method 
with standard Fermi-LAT software In Section [s] we present 
our conclusions, and we discuss future developments and appli- 
cations. 

2. The y-ray DBSCAN algorithm 

The DBSCAN (E ster et al.|1996i| ) is a topometric algorithm used 
to cluster spatial data that are affected by background noise. 
Some modifications have been developed to adapt the original 
DBSCAN algorithm to our study. Our algorithm is mainly built 
upon the following criteria: 

1 . given a list of photons D, where each element is a tuple 
storing positional sky coordinates, let pipk, pi) be the angular 
distance between two photons pk and pi. 

2. We iterate over the full photon list D. A seed cluster C*„ is 
built when a minimum number of photons -i- 1 is enclosed 
within a circle of radius e centered on p, 

3. For each photon pi e C*,, we build the photon list C,|, 
by collecting all the photons pk respecting the condition: 
piPhPk) < s, and Pk i C*„. 

4. For each photon pj e C,|,, if the number of photons enclosed 
within a circle of radius e centered on pj is< K and pj i C*,, 
then Pj will be attached to the final photon list of the clus- 
ter without a recursive search for further neighbours, these 
points are defined density- reachable. 

5. For each photon pj € C,|,, if the number of photons enclosed 
within a circle of radius s centered on pj is > K and pj i 
C*„, Pj is attached to the C*,, and and step 3 is repeated 
recursively. 

6. When both conditions at step 4 and 5 are false, the cluster 
C,„ is built by joining the density- reachable events to those 
in the C*„ and in the C,|, lists. 

7. The process starts again from step 1 searching for new clus- 
ters, skipping all the events already flagged as noise or clus- 
ters, until all the events in D are flagged as cluster, or noise, 
or density-reachable, events. 

8. At the end of the process the full photon list will be parti- 
tioned as follows: 

D = D,is U D„„i,, = {pi e U„,C,„) U (pi i U,„C„,) (1) 

= Dels <^ D„oise 

In this way high densely populated areas are classified as clusters 
(sources), conversely low densely populated areas are classified 
as noise (background). The recursive call of step 3, is not imple- 
mented in the original DBSCAN algorithm, and represents a 
novelty. This new feature, allows to reconstruct clusters with a 
size significantly larger than the e radius, making rare the pos- 
sibility to fragment a single clusters in small satellite clusters. 

' http://fernii.gsfc.nasa.gov/ssc/data/analysis/scitools/overview.html 



Moreover, allows the possibility to reconstruct extended struc- 
tures, in particular extended sources, or filamentary structures in 
the background. 

After the clustering process, each photon in D will be de- 
scribed by a tuple, storing: the photon position (both in galactic 
and celestial coordinates), the photon class type {noise or clus- 
ter), and the ID of the cluster the photon belongs to. Each cluster 
C,„, will be described by a tuple storing the position of the cen- 
troid with his positional error, the ellipse of the cluster contain- 
ment, the cluster effective radius (reff), and number of photons 
in the cluster (A^,,). The ellipse of the cluster containment, is de- 
fined by major and minor semi-axis (cr^ and cr,,, respectively), 
and the inclination angle {cTaipim) of the major semi-axis w.r.t. 
the latitudinal coordinate, {b or DEC). To evaluate the ellipse 
axis we use the Principal Component Analysis method (PCA) 
(JollifFe 1986). This method uses the eigenvalue decomposition 
of the covariance matrix of the the two position arrays x, and y. 
By definition, the square root of the first eigenvalue will corre- 
spond to cTf, and the second to cr,,. The axes represent the two 
orthogonal directions of maximum variance of the cluster. The 

effective radius is defined as r^jj - ^cr^ -i- crj. To find the cen- 

troid of the cluster and its uncertainty, we use a weighted average 
of the position of each photon in C„„ as follows: 

- we define the first order centroid (Cave) as the average of the 
position of each cluster photon: C„yi. = (< x >, < y >). 

- We define the weight array, according to the distance be- 
tween Pk e C„, and Cave- Wk = Ijpipk, Cave)- 

- The cluster centroid Cd,- will result from average of the po- 
sition of each cluster point weighted by w^- 

- The centroid position uncertainty {poserr) is determined by 
propagating the error on the weighted average of Car- We 
have numerically verified that poserr corresponds to a » 95% 
positional uncertainty. 



3. Caveat on the application to y-ray data 

The application of clustering methods, such as the y-ray 
DBSCAN, leads to deal with practical difficulties, related mostly 
to the instrument PSF, and to gradient and/or structures in the 
background. In order to deal with these issues, without biasing 
the detection results, it's recommended to apply some criteria 
that we discuss in the following. 

As first, we comment on the PSF impact. The PSF imposes a 
limit on the capability of an instrument to resolve sources sepa- 
rated by a distance smaller then the PSF size. Sources with sizes 
below the PSF classified as point-like, otherwise are classified 
as extended. A further complication is that the PSF often de- 
pends on the energy; in the case of Fermi-hPC[, the 68% contain- 
ment angle of the reconstructed incoming photon direction, for 
normal incidence photons, has a typical size of a couple of de- 
grees at 100 MeV ( |Fermi-LAT Collaboration|2012| l, and scales 
down to few tenths of degree above the GeV energies The 
size of the PSF is strongly connected to the size e of the y-ray 
DBSCAN scanning brush. Indeed, if e is much smaller than the 
PSF size, it might occur the risk to loose clusters characterized 
by small A^^, or to fragment a cluster with large Np in smaller 
fake satellite clusters. We stress that the formation of satellite 
clusters is a very rare event, thanks to our recursive DBSCAN 
implementation, that is explained in Sec. |2] On the contrary, if 

- http://www.slac.stanford.edu/exp/glast/groups/canda/ 
lat_Performance.html 
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Fig. 1. Photon map for the sky test field 1, with the result of the y-ray DBSCAN detection for = 5 and e - 0.17 deg. The blue 
crosses refer to the simulated sources, the green boxes to 51 detected true clusters, and the red boxes to the 2 fake ones. The black 
dots represent the background events, the remaining colors indicate cluster events. 
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Fig. 2. A close-up of two true clusters reported in Fig. [l] The 
ellipses correspond to the ellipse of the cluster containment. The 
purple and orange points represent the cluster points, the black 
dots represent the background events, the blue crosses the posi- 
tion of the simulated sources, and green boxes the position of the 
clusters centroid 



e is much larger w.rt the PSF, it is likely to build extended clus- 
ters contaminated by the background, or by close sources. 



A careful and self-consistent analysis of the effects of the en- 
ergy dependence of the PSF, and in general of issues related to 
the Fermi-LPil response function, is beyond the scope of this 
paper, where we focus mostly on a statistical characterization of 
the method. These subjects will be investigated in the companion 
paper ( Tramacere |20 1 2| l . 



A second relevant issue, is the inhomogeneity of the back- 
ground, that affects both the choice of e and K. If the back- 
ground is homogeneous over the entire field, the optimal choice 
of a single pair of values of e and K, guarantees a safe rejection 
of the background. Indeed, values of e and K, such that the av- 
erage density of photons within e is significantly larger the the 
average density of the background photons, make rare the chance 
to grow a cluster from a background fluctuation. Unfortunately, 
the y-ray sky shows strong gradients of background, in par- 
ticular at low galactic latitudes. To solve this issue, one could 
think to adapt the value of e and K according to a local value 
of the background photon density. Since e has a strong con- 
straint imposed by the PSF, one should tune mostly the value of 
K. The drawback is that as we increase the value of K to com- 
pensate for the background, we decrease the capability to detect 
cluster with small A^^. To overcome this difficulty, we adopt an 
alternative solution. We use a unique pak of values of e and 
K, for each field, where s is mostly constrained by the PSF, 
and K, by the field average background, ad we take into account 
the background inhomogeneities by defining a significance level 
of the cluster, according to the signal to noise ratio ( |Li & Ma| 
: 1983| ), evaluated from the local background. This is explained 
in detail in Section |6] The capability to reject clusters accord- 
ing to a low significance level, allows to relax the constrain on 
e and K, increasing the number of clusters detected, hence in- 
creasing the detection ratio, and at the same time allows to reject 
spurious sources, due to the significance threshold. Anyhow, to 
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avoid that the background is so high, that the fluctuations in 
the background events, can lead to densities comparable to those 
of weak sources, it's recommended to apply a cut in energy, to 
make this possibility rare. In order to optimize the ratio between 
background and clusters events, in the following we use a thresh- 
old energy of 3 GeV, that mitigates the possible bias due to the 
background fluctuations. 



broad volume of the parameter space, we use a parametric ap- 
proach. We set the range of s in [0. Ih-0.50] deg. with a step of 
0.01 deg., and the range of K in [2h-15], with a step of 1. The 
total amount of detection trials for each test field is 574. We col- 
lect the statistics of the trials, and we investigate the distribution 
of reff and Np, and their connection with e and K, respectively. 



4. Statistical properties of the y-ray 
DBSCAN clusters 

4.1. The test fields 

In this section we study the statistical properties of the clus- 
ters, looking for signatures that characterize random Poissonian 
fields, and fields with point-like sources. To accomplish this task 
we compare results obtained for a test field with only noise (ran- 
dom test field), and the five test fields with noise plus point-like 
sources [sky test fields 1-5). 

As sky test fields we use the same fields used in the Campana' 
et al. ( 2012| l. Each of these five sky fields covers a broad sky re- 
gion, with a galactic longitude extension of 80° < / < 170°, and 
a galactic latitude extension of 40° < b < 65°. The y-ray back- 
ground has been simulated using the standard gtobssim 0tool, 
developed by the FermZ-LAT collaboration, simulating both the 
Galactic and isotropic components for a 2-year long period, us- 
ing a threshold energy of 3 GeV, for a total amount of 9322 pho- 
tons. To this photon list we added 70 simulated sources: for 
each source, the number of photons was chosen from a proba- 
bility distribution given by a power-law, with exponent 2, from 
a minimum value of 4 up to 40 photons, joined to a constant 
tail up to 240 photons. The number of the sources is similar to 



that reported in the Fermi-LAT Second Source Catalog ( Nolan 



|et al. 2012 2FGL hereafter), in the same region of the sky. The 
source events are spatially distributed with a bivariate Gaussian 
probability density function (PDF) with cr"™ = cr™" = 0.2 deg., 
centered at the source location. Five simulated test fields have 
been generated, adding the simulated sources to the diff'use back- 
ground. The only difference in the five realizations is the source 
location, randomly chosen to have different brightness contrast 
between sources and the background. The random test field cov- 
ers the same area of the sky test fields, and a number of events 
equal to the sky test field- 1 (background and sources), for a total 
amount of 1 1044 events. 

In Fig. [T] we show the photon map for the sky test field 1, 
and the result of the y-ray DBSCAN detection for K = 5 and 
e = 0.17 deg. We detect 51 true clusters, and only 2 fake ones. 
A cluster is defined true, if the position of the simulated source 
falls within a circle centered on the cluster centroid, with a radius 
equal to 2poSg„.. We call fake, the remaining clusters. In Fig. 
|2j we show a close-up of two true clusters. The black ellipses 
correspond to the ellipses of the positional error, and the purple 
and orange thick points represent the cluster points, while the 
black thick dots represent the background. 

4.2. Test Strategy 

We want to investigate the statistical properties of the y-ray 
DBSCAN clusters, in particular signatures that distinguish 
purely random fields from fields with point-like sources, and 
their dependence on K and s. To investigate systematically a 



4.3. Statistics ofr^ff and connection witli e 

We start by investigating the distribution of the log{reff) values, 
in the case of the random, and of the sky test field- 1. The dis- 
tribution for the detections collected over the full K-s param- 
eter space (top left panel of Fig. |3]l, shows a symmetric shape 
well fitted by a Gaussian distribution (log-normal w.r.t. r,,//), 
with the mean value of < log^Qir^ff) >^ -0.45 (correspond- 
ing to < r^ff >^ 0.3 deg) and a dispersion of cr]og^^(r^ff) - 
0.23. The log-normal distribution provides a reasonable de- 
scription of the empirical distributions also for individual pairs 
of {K,s) values. An example is given in panel c of Fig. [3] for 
the case K = 3, s = 0.3 deg., where the best fit values are 
< log[g(rjyy) >^ -0.51, and o-\og^g(r,ff) - 0.16. We now investi- 
gate the empirical distribution of log[o(rpyy) for fields with point- 
like sources. In the right panel of Fig. [3] we show the case of the 
sky test field-1. The distributions of logjQ(rpyy) are still described 
by a by a normal. In the case of fake clusters (red dashed line), 
the best fit values of the mean (< logio(re//) >^ -0.46) and of 
the dispersion io-iog^gr^ff - 0.24), are very similar to those found 
in the case of the random test field. On the contrary, the true clus- 
ter distribution (blue hatched histogram) is peaking around the 
value of log[Q(rgyj) ^ -0.67 deg, corresponding to r^fj ^ 0.21. 
deg., very close to the value of the dispersion cr"'" = 0.20 deg., 
used to simulate the sources. Since the simulation parameter 
cr"'" reproduces the effect of the instrumental PSF, we observe 
that for non-random fields, the typical size of the reconstructed 
clusters is constrained by the PSF, suggesting the empirical rule 
to set the value of s of the order of the PSF size. 

To investigate more accurately the connection between s and 
the PSF, we analyse the statistical properties of the quantity 
reff/s as a function of s. For each value of s, we determine the 
median, and the two-sided 1-cr confidence level (CL) interval 
around the median, of the re/f/s distributions. In the left panel 
of Fig. |4]we plot the re/f/e median (blue solid circles) and 1-cr 
CL region, as a function of e, for the random field. We note that 
the reff/s trend is slightly increasing with e, and that the 1-cr 
CL region is consistent with the case r^ffls - 1, but the upper 
boundary shows a systematic increase, compared to the lower 
boundary, for e> 0.30 deg. The trend for the case of the true 
clusters in sky test field 1 (right panel Fig|4|, shows a dififerent 
behaviour. The median of r^ffjE (red solid circles) is slightly 
decreasing with e, showing that, for true clusters, r^ff is not 
sensitive to the size of e, being mostly constrained by the sim- 
ulated PSF size. As expected, for the case of fake clusters (blue 
dashed line), the trend is almost identical to that of the clusters 
in the random field. 



4.4. 



Statistics ofNp and connection witti K 



^ http://ferini.gsfc.nasa.gov/ssc/data/analysis/scitools/iielp/gtobssim. 
txt 



We now investigate the statistics of the distribution of the num- 
ber of photons per cluster. In the case of random fields, we ex- 
pect that the number of photons in a cluster attends a Poisson 
distribution. Indeed, for a generic two-dimensional Poisson pro- 
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Fig. 3. Panel a: distribution of the values of logjg r^jj for the random field case, for the full parameter space (black line) and fit by 
means of Gaussian distribution (blue line). Panel b: the same as in the top panel, for the case of K - 3 and e - 0.3 deg. Panel c: 
distribution of logjo rejj for the case of the sky test field 1, fox fake clusters (red solid line), and true clusters (blue solid line, hatched 
histogram), the dashed lines represent a Gaussian best fit. 
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Fig. 4. .• Left panel: the raffle statistical distribution as a function of e, for the random field case. The blue solid circles represent 
the median, and the grey shaded area represents the 1-cr confidence level region, for each value of e. Right panel: the same as in the 
bottom left panel, for the case of the sky test field 1. The red solid circles represent the median of the true clusters case, and the grey 
area the 1-cr confidence level region. The dashed line shows the 1-cr confidence level region, for the case of the fake clusters. 



cess, the probability to observe a number of events (N(S) - j) 
enclosed by a surface S is given by: 

P,«(5) = ,,= <*»i£«, (2, 

where A is the average spatial density. Translating S in terms of 
we can rewrite: 

PiNis^) = ;■) = ., , (3) 

from which follows that, given the value of K and e, the prob- 
ability to find a cluster as function of K and s will be given 
by 

P„.J..K> = P<m^> > « = 1 - I; '^"'"^"l"-^'-" . (4) 

namely the Poissonian survival function. Anyhow, due to the 
logic of the DBSCAN clustering process, the Poisson statis- 
tics can't be extended from s to r^ff, for any value of e. Indeed, 



a cluster is not a simple collection of points enclosed within a 
surface S , this holds only within the e-sized circle, namely the 
seed of the cluster (C*). If we consider the annulus defined be- 
tween s and the cluster radius rdus, not all the points in the 
annulus will be cluster member, but only those that are at least 
density reachable. This implies that we expect a deviation from 
the Poisson statistics, when r^ff is significantly larger than e, i.e. 
£> 0.3 deg. (according to the analysis presented in the previous 
section). This expected deviation from the Poissonian statistics, 
is confirmed by the plots in the left panels of Fig. [5] In panel a we 
show the distribution of A^^, for the case K - 2 and s = 0.20 deg. 
We note that the Poisson distribution (Eq. [3]) gives a reasonable 
description of the empirical distribution. On the contrary, for the 
case of e = 0.30 deg. (panel b), we observe that the Poisson dis- 
tribution shows larger deviations, in particular for K > 6. When 
we take into account the Np distribution for the full parameter 
space (panel c), we note the Possonian distribution is failing in 
providing a reasonable description of the empirical distribution, 
whilst a log-normal one gives a good fit. 

The log-normal trend of Np is consistent with the log-normal 
trend of the distribution of reff. Since the number of photons in 
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Fig. 5. Left panels: the distribution of A^^ for the random test field, for the case of A" = 2, e = 0.20 deg (panel a, red solid boxes). 
The empty blue bars line represent a Poissonian best fit. The panel b shows the case of = 2 e = 0.30 deg (purple solid triangles). 
The panel c shows the case for the full K-s parameter space, the solid black line represent a log-normal best fit. Right panels: panel 
c shows the distribution of r^^j- (black solid line), and it's best fit by means of a log-normal distribution (red dashed line). Panel d 
shows the Np distribution for the/flA:e clusters in the sky test field 1 (red solid circles), and the blue empty bars a Poissonian best fit. 
Panel e shows the Np distribution for ihefake clusters in the sky test field 1 (blue hatched histogram), the log-normal best fit (red 
dashed line), and the Poissonian fit (solid black line.) 



a cluster will be approximatively A^^ oc Ar\r, we can write the 



PDF of A^„ 



f(Np) cc f{r%f)A. 



(5) 



To evaluate the distribution of r^,, we can use the stan- 
dard theory of the transformation of Random Variables (RV) 
( Papouhs|1965| l. It can be easily proved that, given a RV X hav- 
ing a log-normal distribution. 



/(^) = 



1 



: exp 



l -MX)-p \ 



the RV Y - X^, will follow a log-normal distribution given by: 



(6) 



f(Y) 



1 



27 V2^ 



: exp 



/ -ln(y)-2/i \ 
I 4(^2 j 



(7) 



Indeed, our r^^^ distribution, for the random field (panel d. 
Fig. |5]l, is fitted by a log-normal distribution peaking at ^ 0.03 
deg^. Hence, according to Eq. [sjwe expect that also f{Np) will 
follow a log-normal distribution, when A^^ is not ruled by a 
Poissonian statistics. 



We verify, that the same statistical trends, describe the real 
sky fields. The panels e and f in Fig. [3] show the statistical dis- 
tribution of Np for the sky test field 1 case. In agreement with 
the analysis concerning the random test field, we see that the 
fake clusters (e - 0.30 deg., panel e in Fig. [5]l, are described 
by a Possonian statistic, whilst, the true clusters (panel f in Fig. 
[5]l, are better described by a log-normal distribution (red dashed 
line), compared to a Poissonian one (solid black line). We also 
observe that the log-normal law, describes reasonably the em- 
pirical distribution, only for values of Np < 50, whilst shows 
significant deviation in the tail, consistent with the statistics of 
our simulated sources population. 

To complete this statistical characterization, we investigate 
the distribution of the number of detected clusters, as a func- 
tion of the threshold K. According to Eq. |4] we expect that the 
number of detected cluster, for a random field, follows a Poisson 
survival distribution. The plot a of Fig. |6] confirms our hypoth- 
esis, indeed the Poisson survival function provides a reasonable 
description of the empirical distribution. The same holds for the 
case of fake clusters of the sky test field 1 (plot c Fig.|6]l. On the 
contrary, in the case of true clusters (panel d Fig.|6]l, the Poisson 
survival distribution is not able to reproduce the observed trend, 
consistently with the non-poissonian statistic of the simulated 
clusters . The panels b and e of Fig. |6] show the 1 - cr CL region 
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Fig. 6. Panel b: the A^^ statistical distribution as a function of K, for the random field case. The blue solid circles represent the 
median, and the grey shaded area represents the 1-cr confidence level region around the median, for each value of K. The dashed 
black line represents the Np - K + \ law. Panel e: same as in panel b, for the sky test field 1 case. Panel a: number of detected 
clusters for the random test field case (blue solid points), as a function of K, and best fit by means of a Poissonian survival function 
(red empty bars). Panel c: number of detected cluster for the sky test field 1 case (black solid points), for the case of fake clusters, 
as a function of K, and best fit by means of a Poissonian survival function (red empty bars). Panel d: number of detected cluster 
for the sky test field 1 case (red solid boxes), for the case of true clusters, as a function of K, and best fit by means of a Poissonian 
survival function (black empty boxes). 



for the Np, as a function of K. We note, that both in the case 
of random and sky field true clusters, the lower boundary of the 
region is constrained by the equation y = K + 1, that is consis- 
tent with the y-ray DBSCAN logic. On the contrary, the upper 
boundary shows a different behaviour. In the case of the ran- 
dom field, the upper boundary deviates from the lower boundary 
compatibly with the fluctuations of the events around the e cir- 
cle, and ranges from about 8 to about 16. On the contrary, in the 
case of sky field true the upper boundary is constrained by the 
statistics of the number of events in the simulated sources, and 
rages from about 60 to 100. 

5. Testing the detection performance with 
simulated y-ray data 

In this section we investigate the detection performance of the 
y-ray DBSCAN. As first point, we study the dependency of the 
detection efficiency on K and s, and their impact on the spurious 
ratio, and on the detection efficiency. Then, we investigate the 
capability of the algorithm to reconstruct the simulated clusters, 
and the positional accuracy of the reconstructed centroids. We 
test the detection performance of the y-ray DBSCAN using as 
benchmark the five sky test fields used in the previous section, 
and exploring the same parameter space. 

5.1. Detection efficiency and spurious ratio as a function of 
K and s 

To investigate the detection performance of the y-ray DBSCAN, 
we run, for each of the five sky test fields, and for each pair of 



values K,E, a y-ray DBSCAN detection. For each detection run, 
we build a cluster catalog. Starting from the cluster catalog, we 
build the corresponding candidate catalog. The candidate cata- 
log is a list of sources built by taking into account two possible 
biases, the confusion, and the multiple association, in detail: 

- a cluster is defined true, i.e. with a possible counterpart, if 
the position of the simulated source falls within a circle cen- 
tered on the cluster centroid, with a radius equal to IpoSerr- 

- Two or more true clusters are defined confused, if they have 
the same counterpart 

- A true cluster has a multiple association, if has more than 
one counterpart. 

We stress that, the number of confused clusters is negligible, in- 
deed the average number of confused clusters per run is about 
0.08, and no confused clusters are found for K > A, and that the 
average number of multiple associations per run is about 0.2. 

The final candidate catalog will count a number of candidate 
sources Nsrc, each identified by a unique SRC id. The number of 
spurious sources will be Njake - Nsrc - N,rue- In order to to char- 
acterize the performance, we define the following parameters: 

- the detection efficiency: 

D„^{ N^^J0^y if (^-^ - ^ Ns,„{NpSim. > K) 

I 1.0, if {N,rue-N fake) >Nsi,niNpSim.>K) 

where Nsi,„{NpSim. > K) is the number of simulated sources 
with a number of simulated events larger than K 

- the true detection ratio D,rue - N true /Nsrc 
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Fig. 7. Isolevel maps for Dfake (panel a), D,,.,,^ (panel b), D^ff (panel c), and Q (panel d), for the sky test field 1. The white lines 
show the isolevel = 0, the black lines show the isolevel = 0.68, and the blue lines show the isolevel = 0.95. 



the spurious detection ratio Dfak^ - Nfake/Nsrc 
the overall detection quality factor (Q), that takes into ac- 
count the tradeoff between and Dfake, defined as: 



son, in such a case, we report a value of £>c// 
applies to Q. 



1.0. The same 



(9) 



The Deff parameter shows the fraction of simulated clus- 
ters, above the threshold NpSim - K, detected by the method, 
net of the/aA;e ones. Hence, does not provide an indication of 
the spurious contamination. For this reason we have introduced 
the Q parameter, which rescale the £>e// according to the ratio 
between /flA:e clusters, and found clusters Nsn-- We remind that, 
according to the definition in Eq.[8] it's possible to obtain 
values of > 1.0. Assume to have a simulated cluster such 
that, for a given K and e, the corresponding seed cluster has a 
size A^* - NpSim. - K. In the case of no background events 
within the circle of radius e, this cluster will be rejected. If we 
have one or more background events contained within the circle 
of radius e, i.e. A^* > K, the cluster will be detected. For this rea- 



In Fig. |7]we summarize the detection runs for the case of sky 
test field 1, for the full parameters space with K > 2. The panel 
a shows the isolevel map of the/aA;e clusters detection ratio. The 
gradient in the isolevel map is quite sharp, and roughly half of 
the parameter space shows no fake clusters (white isolevel line). 
To have a better understanding of the impact of fake clusters, 
it's interesting to compare the Dj^ke isolevel map to the 
isolevel map (panel b Fig. |7]l. Also in this case the map shows 
a sharp gradient, and the region with Dtme > 0.95 overlaps the 
Dfake - region. These two maps, clearly show the region of the 
parameter space where the algorithm has the best performance, 
but the D,rue and D fake ratios do not provide information on the 
ratio between the number of true detected clusters and the num- 
ber of simulated clusters. At this regard more information are 
provided by the £>f// isolevel map (panel c. Fig. |7]). To focus 
on the "effective" volume of the parameter space, we hide by a 
white area the region where £>c// < 0. We note that the isolevel 
lines Deff = and the isomap lines in the maximum gradient 



8 



A. Tramacere and C. Vecchio: y-ray DBSCAN application to Fermi-LPS y-ray data. 




separation (arcsec) 



Fig. 8. Panel a: red solid boxes show the mean positional error of the centroid, for true clusters in sky test field 1, and the standard 
deviation (vertical error bar), vs. A^^. The clusters are binned in Np, with the bin width indicated by the horizontal error bar The 
black solid circles represent the corresponding trend for the distance between the cluster centroid and the simulated source position. 
Panel b: the distribution of the distance between the simulated source position and the cluster centroid, expressed in arcsec, for the 
case of e = 0.10 deg. (black line), e - 0.15 deg. (blue line), and e = 0.20 deg. (red lines). Panel c: the cumulative distributions 
corresponding to panel b. 
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Fig. 9. Top panel: the average number of photons associated 
to each clusters A^^, and their dispersion (vertical bar) vs. the 
number of photons simulated (A^^ sim). The red points refer to 
the sub parameter space e = 0.15 deg., and the solid blue circles 
to the s - 0.20 deg. sub space. The solid green lines represent the 
law Np = Np sim. The dashed lines represent the law Np = Np 
sim. ±10. Bottom panel: The corresponding fractional deviation 
(A^p - NpSim.)/NpSim. 

area show a positive correlation between K and e, meaning that 
an increased value of s , requires an increased value of K, to 
have better background rejection. To evaluate better the trade- 
off between D,rue and Dfake, we plot in the panel d of Fig.|7] the 
isolevel map of Q. This plot shows that the area corresponding 
to 2 > 0.95, is consistent with that found in the case of Dgfj. 
In Tab. 1 we report the Dgff values obtained for all the five sky 
fields, for detections with a number of fake sources < 6. We note 
that the average values of true clusters ranges between 44 and 



51, with the. fake ones ranging between 1 and 3, and an average 
Deff between 0.96 and 1.0. This is a very promising result. 

5.2. Cluster reconstruction, and positional accuracy 

The positional accuracy of the topometric methods, is probably 
the most important feature of this class of algorithms. In Sec. 
|2] we have described our weighting method to reconstruct the 
centroid of the cluster 

The panel a of Fig. |8] shows by red solid boxes the mean 
positional error of the clusters centroid and the standard devia- 
tion (vertical error bar) vs. Np, for the true clusters of the sky 
test field 1 with e < 30 deg. The clusters are binned in A^^, 
with the bin width indicated by the horizontal error bar As ex- 
pected, the uncertainty on the reconstructed cluster centroid is 
poserr a: cr,,,,,/ VA;, (solid red line). The solid black circles rep- 
resent the corresponding trend for the separation between the 
simulated cluster position and the reconstructed cluster centroid. 
For Np > 30, the separation is below 2'. In the panel b of Fig. 
|8]we plot the histogram of the distribution of the angular sep- 
aration between the position of the simulated source, and the 
position of the cluster centroid. For the three cases of e = 0.10 
deg., s = 0.15 deg., and s - 0.20 deg., the positional error is 
below the 1.5', for the 68% of the sample. 

Besides positional accuracy, is also important to understand 
the capability of the y-ray DBSCAN to reconstruct the sim- 
ulated cluster in terms of number of photons. Indeed, this in- 
formation gives an idea of the average number of background 
photons contaminating the reconstructed cluster In the top left 
panel of Fig. |9j we show the scatter plot of Np vs. the number of 
simulated events (A^,, sim.). The solid points represent the aver- 
age value of Np, for a given value of A^^ sim., and the error bar, 
corresponds to the standard deviation. The solid green line rep- 
resents the case Np = A^^ sim., and the dashed upper and lower 
lines, represent Np = A^^ ± 10 sim., respectively. Both for the 
cases of e = 0.15 deg., and s = 0.20 deg., the scatter is bounded 
by the dashed lines, showing that the largest excess in the Np 
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Table 1. Summary of the detections obtained for all the five 
sky fields, for detections with a number of fake sources < 6. 
j^cut ^ jg number of simulated sources with a number 

Sim ' 

of simulated events larger than K, the number separated by the / 
symbol, indicates the full number of simulated sources. 



is about 10 photons, independently of A^^ sim. We note, that in 
the case of e = 0.15 deg., the number of reconstructed photons, 
systematically underestimates the simulated number, whilst, the 
£ = 0.20 deg. case does not shows this bias. It's possible to ap- 
preciate better this elfect, in the bottom left panel of Fig.|9] where 
we show the fractional reconstruction error (A^p - Np sim.)/Np 
sim., vs. A^;, sim. The solid green line represent the case with 
error, and the dashed lines represent the ±20% boundaries. The 
bias on Np in the case of e = 0.15 deg., shows again the strong 
correlation between e and the PSF radius. When s is smaller 
then the cr""' (that in our simulations reproduces the PSF elfect), 
the number of reconstructed events A^^ is systematically smaller 
than Np sim., on the contrary, when the s radius matches the 
PSF radius size (s - 0.20 deg.), the bias disappears. 

6. Cluster significance, baci<ground 

inhomogeneities, and rejection of spurious 
clusters 

Even though, we have identified the region of the K-e parame- 
ter space, where the detection efficiency is larger, and the prob- 
ability to detect fake cluster is lower, in the application to real 
data, it's mandatory to provide a significance level, expressing 
the probability of a cluster being not originated in a background 
fluctuation. We propose a method derived from the Li & Ma 
( 1983| l approach, based on the evaluation of the signal to noise 
(S/N) ratio. A significance method based on the S/N ratio fits 
well the the y-ray DBSCAN implementation, because the algo- 
rithm directly provides a partition of the photon list in cluster 
and noise events. Hence, for each cluster we can evaluate easily 



the S/N ratio, knowing the exact nature of each event. The proce- 
dure to evaluate the significance is summarized by the following 
items: 



1. 



for each cluster, we define an annular region, with an inner 
radius r,„, and an external radius ro„,. 

r,„ is set to an initial value of r,„ - Ir^jj, and is adaptively 
increased with a step of ri„/ 10, for a maximum of 10 trials, 
until at least the 95% of the cluster events are enclosed within 



3. 
4. 



ro,„ is set to 3r,„. 

We count all the cluster events A^' 



events A^l 



hkg' 



and all the background 
enclosed within the circle with radius r,„ and 



centered on the cluster centroid. 

We determine the N°"[ background level, rescaling the num- 



bkg 



< r < r„ 



to a circle with 



ber of background events in r„ 
radius r,„. 

To evaluate possible gradients in the background, we select 
a region enough far from the cluster to sample properly the 
background level, and enough close to the cluster, to measure 
a local background level. At this regard, we define the radius 



)/2, and we evaluate the average background 
level (A^,'"™') in a circle of radius r,„, centered on each point 



out + f ii 

jloca 
hkg 



ave 
'out 



< r < r„ 



we set 



' out ^ ' ^ ' out ■ 

If no background points are found in r" 

\flocat _ ]\[Out 
^^bkg ~ ^^bkg- 

By comparing N\"^"' to A^|,'[^, we evaluate the fraction of noise 
already resolved by the y-ray DBSCAN, and we evaluate 

hkg 



the effective background level N^^ , by correcting A^j 



hkg- 



we evaluate the significance according to the Likelihood 
Ratio Test (LRT) method proposed by |Li & Ma| ( [T983l l: 



In 



2A^" 



"isrc ^ "I hkg 



+ N'" In 



2A^" 



N'" + N 



hkg 



)(10) 



Under the hypothesis that a cluster is due to a background 
fluctuation, the variable S is expected to follow a chi square 
distribution, with one degree of freedom (x(l)'^). In the left panel 



of Fig. 10 we plot the distribution of 5^,^, for the fake clusters 
in the sky test field 1 (blue histogram), compared to axi^)^ dis- 
tribution. The empirical distribution, is well described by the ex- 
pected ;t'(l)^ distribution, proofing that the value of Sds, can be 
used as the "significance" of the detected cluster A very illus- 
trative example of the power of Sds in rejecting /afce clusters, 
is given by the plot in the right panel in Fig [Toj where we plot 
the Dfake ratio isolevel map, applying the selection Sds > 4.0. 
The fake ratio is for the parameter space with s < 0.25 deg. 
For 0.25 deg. < e < 0.35 deg., there are fluctuations showing 
Dfake ^ 0.05. Only for s> 0.40 deg. and K < 8, the fake ra- 
tio shows a significant increase, but we stress that in this region 
of the parameter space, e is more then double of the PSF size, 
hence this is a region of the parameter space that should not be 
used in the detection with real data. 



7. Application to real Fermi-LAJ data 

The last step in our investigation of the y-ray DBSCAN, is the 
application to real FermZ-LAT y-ray data. We select the same re- 
gion of the sky used for the simulated test field ( 80° < / < 170°, 
and 40° < b < 65°), and we extract all the photons with energy 



10 



A. Tramacere and C. Vecchio: y-ray DBSCAN application to Fermi-hAT y-ray data. 




(significance)' 



Fig. 10. Left panel: The distribution (blue line) of the square of the significance, for the fake clusters in the sky test field 1, for 
the full K,s parameter space, compared to ax^ distribution with one degree of freedom. Right panel: the spurious ratio Dfai^^ for 
Scis > 4.0, the white line shows the isolevel Djake - 0.0. 
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Fig. 11. Aitoff projection of the Fermi sky region. The purple boxes represent the y-ray DBSCAN sources (/T = 8, s = 0.21 deg.). 
The green crosses are the 2FGL sources with TS > 16, the red one those with TS < 16. There are no fake sources, and the y-ray 
DBSCAN finds all the sources with TS > 16, except only one, enclosed by the red circle, and with the center positioned at the 
edge of the field. 



£■ > 3 GeV. The photons are collected for the same time span 
of the 2FGL catalog . We repeat the detection test performed in 
the case of simulated data (see Sec.[5]and Sec.[6]l, restricting the 
parameter space to2 < K < 10, and 0.10 < e < 0.30 deg. 

To properly understand the detection performance, we need 
to take into account that the 2FGL catalog has been built us- 
ing photons with an energy threshold of 100 MeV, whilst we 
use a value of 3 GeV. A possibility is to select sources with a 



reported flux larger than zero, in the 3-10 GeV band flux col- 
umn of the 2FGL. This flux-based selection, is not the best way 
to study the detection performance of the y-ray DBSCAN , in- 
deed the flux does not contain a unambiguous relation with the 
significance of the detection, for that energy threshold. A more 
reliable criterion is to select the sources according to the signif- 
icance reported in the 2FGL. The 2FGL detection significance 
is given by the ^TS . The TS is the test statistic defined as 
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13. Dfake (left panels) and D,,// (right panels), for the real sky detections, using the 2FGL7-5>i5 catalog. Panels a,b: no cut on 
applied. Panels c,d: S ds > 2.0 Panels e,f: Sds > 4.0 
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Fig. 14. Left panel: scatter plot of S ds vs. ^JTS . For each source in our 2FGLj-s>i6 list, associated to one or more y-ray 
DBSCAN cluster, we plot the ^JTS in the 3-10 GeV band, v.s. the average values ofScu and its standard deviation (represented by 
the error bar). Right panel: the distribution (blue line) of the square of the significance, for tht fake clusters in the Fenni-hPC[ real 
sky, for the full K,e parameter space, compared to ?ix^ distribution with one degree of freedom. 
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Fig. 12. The scatter plot the positional error of the y-ray 
DBSCAN clusters vs. the positional error of the correspond- 
ing associated 2FGL7-s>if, sources. For each 2FGL7's>i6 source 
associated to one or more y-ray DBSCAN clusters, we plot the 
error on the position of the reconstructed cluster centroid and 
its standard deviation (represented by the error bar). The dashed 
red line represents a linear best fit with a slope of ^ 0.99, and an 
intercept of ^ 9.53. 



TS - 2(log L(source) - log L(no source)), where L is the Ukeli- 
hood of the data given the model with or without a source present 
at a given position on the sky ( Nolan et al. 2012 1. We apply a se- 
lection according to ^TS > 4, and we refer to the corresponding 
source list (counting 35 sources) as 2¥GLjs>\6- 

An example of the application of the y-ray DBSCAN to 
real Fermi-LAT 



data is given in Fig. 11 where we report 
an Aitoff projection in galactic coordinates of the analysed y- 
ray sky region. The red crosses represent the 2FGL sources with 
TS < 16 in the 3-10 GeV band, and the green crosses repre- 
sents those with TS > 16. The purple boxes represent the y-ray 



DBSCAN sources found for = 8, e = 0.21 deg. For this 
choice of parameters, we find no fake sources, and we find all the 
sources with with TS > 16, except only one, enclosed by the red 
circle, and positioned at the edge of the sky region, with a galac- 
tic latitude / = 64.85 deg. In Tab. 2 we summarize the detection 
performance, for detections with a number of fake sources < 4. 
We note that values of true clusters ranges between 35 and 34, 
out of the 35 present in the 2FGL7s>i6. The fake ones range be- 
tween 1 and 4, and we obtain an average detection efficiency of 
Deff = 0.94. 

In Fig. [12] we compare the localization performance of the 
y-ray DBSCAN algorithm with that returned by the likeli- 
hood analysis implemented in the Fermi Science Tools. For each 
source in our 2FGL7-s>i6 list, associated to one or more y-ray 
DBSCAN clusters, we plot the the error on the position of the 
reconstructed cluster centroid and its standard deviation (repre- 
sented by the error bar) vs. the 95% positional uncertainty re- 
ported in the 2FGL. We evaluate the 2FGL 95% positional un- 
certainty as Vcr95 ,„,„ 0-95 ,„„,, where 0-95 „„■„ and crgs ^oj., are the 
semimajor and semiminor axes of the 95% confidence source 
location region, respectively. The dashed red line represents a 
linear best fit, with a slope of ^ 0.99, and an intercept of ^ 9.53, 
showing that the error on the position of the reconstructed clus- 
ter centroid, performed with a threshold of 3 GeV, is of the same 
order of the 95% positional uncertainty reported in the 2FGL 
catalog, performed above 100 MeV. 

To test the reliability of the significance S ds to reject spu- 
rious sources, in Fig. [13] we plot the Df„ke and D^jj, based on 
the 2FGLTs>ih catalogTThe panels a and b, correspond to the 
case of no selection on S^is- Both the D/ajte and the D^jj trends 
are very similar to the case of the simulated sky. If we apply a 
significance cut of S ^ > 2.0 (panels c,d), we observe that the 
number of spurious ratio is Djat-^ < 0.05 for almost half of the 
parameter space (region to the right of the purple line). The more 
severe cut of Sds > 4.0 (panels d,e), removes all the /flA:e clus- 
ters except two, for e < 0.15 deg. Only for s > 0.25 deg., the 
Dfake ratio shows a significant increase, ranging from 0.05 up 
to ^ 0. 1 . In agreement with our analysis on simulated data, the 
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2FGL7'5>i6 


True 


Fake 


K 


£ 


D.ff 


Q 


Fermi-sky 


35 


34 





8 


0.21 


0.97 


0.97 




35 


34 


1 


7 


0.20 


0.94 


0.92 




35 


35 


2 


7 


0.21 


0.94 


0.89 




35 


35 


4 


6 


0.19 


0.89 


0.79 


average 


35 


34.50 


1.75 


7.00 


0.20 


0.94 


0.89 



Table 2. Summary of the detection performance for the real 
Fermi-LAiT field, for detections with a number of fake sources 
<4. 



region of the parameter space where e is comparable to the PSF 
size, gives the better performance. 

To have a further confirmation about the robustness of our 
significance, we plot in the right panel of Fig. 14 S ds vs. ^JTS . 
For each source in our 2FGL7s>i6 list, associated to one or more 
7-ray DBSCAN cluster, we plot the ^/fS in the 3-10 GeV band, 
v.s. the average value of Sds and its standard deviation (repre- 
sented by the error bar). The average value of Sds and its stan- 



dard deviation are evaluated from the list of all the cluster asso- 
ciated to the same 2FGL source. The solid blue boxes represent 
the full K,s parameter space case, and the red solid circles rep- 
resent the s = 0.10 deg. case. The dashed black line represents a 
linear best fit. The slope of the linear fit is ^ 0.5. The strong cor- 
relation in the scatter plots (r ^ 0.98, for both data sets), proves 
that our significance implementation is consistent with the y/TS 
reported in the 2FGL, and the slope of the hnear fit suggests that 
5,,, -0.5 y/fS. 



8. Conclusions 

For the first time, we have used the DBSCAN for the detec- 
tion of sources in y-ray astrophysical images. We have imple- 
mented a new version of the DBSCAN, the y-ray DBSCAN, that 
is optimized for the application to y-ray astrophysical images, 
with relevant background noise. Our y-ray DBSCAN, presents 
the novelty of recursive call of the DBSCAN algorithm, that 
allows an excellent reconstruction of the cluster, with an effec- 
tive background rejection. We have tested the algorithm with a 
sample of simulated y-ray Fermi-LAH fields, to give a statisti- 
cal characterization of the method, and to benchmark the detec- 
tion performance. The results, with the simulated y-ray data, are 
summarized by the following items: 

- The radius of y-ray DBSCAN scanning brush e , has a 
strong correlation with the instrumental PSF radius. We find 
that the typical size of the reconstructed true cluster is of the 
order of the simulated PSF size cr""", and that the precision 
of the reconstructed centroid is of the order of cr*™/ -^/A^. 

- The number of reconstructed events Np is ruled by the 
Poissonian statistics in the random fields, and for the fake 
clusters. On the contrary, for true clusters, the statistics of 
Np, is ruled by that of the simulated sources. 

- The fractional error on the reconstructed events number is of 
the order of 20% for Npsim. < 50, and is negligible for larger 
values, with best performance obtained when cr"'". 

- We have investigated the detection performance, for a wide 
range of the K,e parameter space, and we have identified 
the region with the best performance in terms of detection 
efficiency, and spurious ratio. 

- We have implemented an algorithm for the estimate of the 
Signal to Noise (S/N) ratio, able to deal with local back- 
ground inhomogeneities and nearby sources contamination. 



and we have successfully used the S/N estimate to determine 
the significance of the clusters, using the definition in Li & 
[Ma (1983 ). 

- Our cluster significance, S ds, for random clusters, follows 
the xiVl^ statistics, and can be used to reject spurious 
sources. The chance to find spurious sources for Sds > 4, 
is negligible. This means, that our Sds is a robust a reliable 
tool to reject spurious sources, and that;^'(l)- statistics can be 
used to evaluate the probability of a cluster to be spurious. 

We have successfully applied the y-ray DBSCAN to real 
Ferm/-LAT data. We have found an excellent agreement with 
results from the simulated fields. We tested our detection perfor- 
mance using as catalog, the 2FGL sourced with a '\/{TS) > 4 
cut. The results, with the real Fermi-LAJL y-ray data, are sum- 
marized by the following items: 

- the eiTor on the position of the reconstructed cluster centroid, 
performed with a threshold of 3 GeV, is of the same order of 
the 95% positional uncertainty reported in the 2FGL, per- 
formed above 100 MeV. 

- We tested the y-ray DBSCAN significance, finding that it is 
strongly correlated with the TS provided in the 2FGL. The 
significance cut, allows to remove safely spurious clusters. 

- The detection efficiency with real data is excellent, we are 
able to find all the 35 sources with '\/{TS) > 4. 

- When working with s of the order of the instrumental PSF 
size, we obtain the best performance, in terms of spurious 
rejection, and detection efficiency 

In general, we find that the y-ray DBSCAN is a very pow- 
erful detection method to find clusters in y-ray images, coiTe- 
sponding to real sources. It has the great advantage to deal self- 
consistently with gradient in the background, providing an ef- 
fective rejection of spurious clusters. Our implementation of the 
detection significance, in addition to the algorithm to evaluate 
local fluctuations in the background, allows to apply statistically 
significant selection, making even more effective the rejection of 
spurious sources. 



In a companion paper (Tramacere 2012;), we will a apply the 



method to the ferwi-LAT sky, showing the potentiality for the 
discovery of new sources, in particular of small clusters located 
at high galactic latitude, or cluster on the galactic plane, affected 
by a strong background. We will also investigate how to plug 
the energy dependence of the PSF into the y-ray DBSCAN al- 
gorithm, and how to improve the detection performance taking 
into account other ferm; -LAT calibration properties. 

We remark that, since the y-ray DBSCAN provides also 
density maps, it can potentially be used in the detection of large 
scale structures in the galactic y-ray background, providing pat- 
terns to compare to the interstellar gas distribution. We also 
stress, that the application of this method are not limited to y- 
ray images, but can be potentially used for any application re- 
lated to the detection of spatial, and/or spatio/temporal clusters. 
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